26 Oct 2016

Statistical Computing

Optimization

  • Continuous model parameters
  • Meta or tuning parameters
  • Combinatorial structures, e.g. models

Simulation and Integration

  • Mean and variance, \(E f(X) = \int f(X) \ dP\)
  • Probabilities, \(P(X \in A) = \int_{(X \in A)} \ dP\)
  • Marginalization, Bayesian posteriors

Statistical Computing

R programming

  • Functions and environments
  • S3 object system
  • Rcpp
  • Vectorization and modularity

RStudio and tools

  • Profiling using profvis
  • Debugging
  • Benchmarking using microbenchmark

Statistical Computing

  • Assignment 1: Density estimators and smoothers can often be expressed as matrix multiplication. Special algorithms are faster. Optimization of tuning parameters must be considered.
  • Assignment 2: The EM-algorithm is an algorithm for optimizing the likelihood. Algorithms for computing the observed Fisher information must be considered as well.
  • Assignment 3: Rejection and importance sampling of univariate random variables are used for computing probabilities or integrals.
  • Assignment 4: Particle filters and MCMC are examples of multivariate simulation techniques for computing (conditional) expectations and marginal distributions.

Smoothing again

Degree 19 polynomial

The model matrix

intercept <- rep(1/sqrt(200), 200)
polylm <- lm(y ~ intercept + poly(x, 19) - 1, data = bivar)
X <- model.matrix(polylm)
X[1:5, 1:10]
   intercept poly(x, 19)1 poly(x, 19)2 poly(x, 19)3 poly(x, 19)4
1 0.07071068   -0.1218654    0.1557699   -0.1815656    0.2018003
2 0.07071068   -0.1206280    0.1510253   -0.1705062    0.1813153
3 0.07071068   -0.1194281    0.1464706   -0.1600514    0.1623497
4 0.07071068   -0.1181907    0.1418213   -0.1495450    0.1436961
5 0.07071068   -0.1169533    0.1372204   -0.1393151    0.1259391
  poly(x, 19)5 poly(x, 19)6 poly(x, 19)7 poly(x, 19)8 poly(x, 19)9
1  -0.21760990   0.22959357  -0.23816564   0.24365837  -0.24631459
2  -0.18448000   0.18066873  -0.17051317   0.15469871  -0.13394890
3  -0.15461499   0.13799690  -0.11381768   0.08361997  -0.04910543
4  -0.12605361   0.09861139  -0.06375853   0.02423082   0.01705362
5  -0.09966962   0.06362414  -0.02149210  -0.02267310   0.06478011

The model matrix columns as functions

The model matrix columns as functions

The model matrix is (almost) orthogonal

image(Matrix(t(X) %*% X))

Estimation with orthogonal design

(t(X) %*% bivar$y)[1:10, 1]
   intercept poly(x, 19)1 poly(x, 19)2 poly(x, 19)3 poly(x, 19)4 
 -12.5674795   35.2385655   -0.8484913  -15.5873696    9.7383642 
poly(x, 19)5 poly(x, 19)6 poly(x, 19)7 poly(x, 19)8 poly(x, 19)9 
 -15.3473035   -5.1107729    2.1143983    0.8003312    0.3117874 
coef(polylm)[1:10]
   intercept poly(x, 19)1 poly(x, 19)2 poly(x, 19)3 poly(x, 19)4 
 -12.5674795   35.2385655   -0.8484913  -15.5873696    9.7383642 
poly(x, 19)5 poly(x, 19)6 poly(x, 19)7 poly(x, 19)8 poly(x, 19)9 
 -15.3473035   -5.1107729    2.1143983    0.8003312    0.3117874 

Estimation with orthogonal design

With an orthogonal design matrix the normal equation reduces to the estimate \[\hat{\beta} = X^T Y\] since \(X^T X = I\).

The predicted (or fitted) values are \(X X^T Y\) with smoother matrix \(S = X X^T.\)

With homogeneous variance \[\hat{\beta}_i \overset{\text{approx}}{\sim} \mathcal{N}(\beta_i, \sigma^2),\] and for \(\beta_i = 0\) we have \(P(|\hat{\beta}_i| \geq 1.96\sigma) \simeq 0.05.\)

Thresholding

Thresholding-based smoother

Discrete Fourier transform

Introducing \[x_{k,m} = \frac{1}{\sqrt{n}} e^{2 \pi i k m / n},\] then

\[\sum_{k=0}^{n-1} |x_{k,m}|^2 = 1\]

and for \(m_1 \neq m_2\) \[\sum_{k=0}^{n-1} x_{k,m_1}\overline{x_{k,m_2}} = 0\]

Discrete Fourier transform

Thus \(X = (x_{k,m})_{k,m}\) is an \(n \times n\) unitary matrix; \[X^*X = I\] where \(X^*\) is the conjugate transposed of \(X\).

\(\hat{\beta} = X^* Y\) is the discrete Fourier transform of \(Y\). It is the basis coefficients in the orthonormal basis given by \(X\); \[Y_k = \frac{1}{\sqrt{n}} \sum_{m=0}^{n-1} \hat{\beta}_m e^{2 \pi i k m / n}\]

or \(Y = X \hat{\beta}.\)

Discrete Fourier transform

X <- outer(0:199, 0:199, 
           function(k, m) exp(2 * pi * 1i * (k * m) / 200) / sqrt(200))
image(Matrix(abs(Conj(t(X)) %*% X)), useAbs = FALSE)

The matrix

The matrix

Estimation

We can estimate by matrix multiplication

betahat <- Conj(t(X)) %*% bivar$y # t(X) = X for Fourier bases
betahat[c(1, 2:4, 101, 200:198)]
[1] -12.5675+ 0.000i  -2.9576+25.186i   6.8961+14.840i   0.4994+ 2.137i
[5]  -0.7107- 0.000i  -2.9576-25.186i   6.8961-14.840i   0.4994- 2.137i

For real \(Y\) it holds that \(\hat{\beta}_0\) is real, and the symmetry \[\hat{\beta}_{n-m} = \hat{\beta}_m^*\] holds for \(m = 1, \ldots, n - 1\). (For \(n\) even, \(\hat{\beta}_{n/2}\) is real too).

Modulus distribution

Note that for \(m \neq 0, n/2\), \(\beta_m = 0\) and \(Y \sim \mathcal{N}(X\beta, \sigma^2 I_n)\) then
\[(\mathrm{Re}(\hat{\beta}_m), \mathrm{Im}(\hat{\beta}_m))^T \sim \mathcal{N}\left(0, \frac{\sigma^2}{2} I_2\right),\]

hence \[|\hat{\beta}_m|^2 = \mathrm{Re}(\hat{\beta}_m)^2 + \mathrm{Im}(\hat{\beta}_m)^2 \sim \frac{\sigma^2}{2} \chi^2_2,\] that is, \(P(|\hat{\beta}_m| \geq 1.73 \sigma) = 0.05.\)

Thresholding Fourier

Cosine and sine expansion

The coefficients are not independent (remember the symmetry), and one can alternatively consider\(^{(1)}\) \[\hat{\gamma}_m = \sqrt{2} \mathrm{Re}(\hat{\beta}_m) \quad \text{and} \quad \hat{\gamma}_{n' + m} = - \sqrt{2} \mathrm{Im}(\hat{\beta}_m)\] for \(1 \leq m < n / 2\). Here \(n' = \lfloor n / 2 \rfloor\).

These coefficients are the coefficients in a real cosine and sine basis\(^{(2)}\) expansion, and they are i.i.d. \(\mathcal{N}(0, \sigma^2)\) distributed.

\({}^{(1)}\)Let \(\hat{\gamma}_0 = \hat{\beta}_0\), and \(\hat{\gamma}_{n/2} = \hat{\beta}_{n/2}\) for \(n\) even.
\({}^{(2)}\) The basis vectors are \(\sqrt{2} \cos(2\pi k m / n)\) and \(\sqrt{2} \sin(2\pi k m / n)\).

Thresholding Fourier

Fourier based smoother by thresholding (red)

Using five largest coefficients. Blue is best degree six polynomial fit.

Discrete Fourier transform

What is the point?

The point is that the discrete Fourier transform can be computed via the fast Fourier transform (FFT), which has an \(O(n\log(n))\) time complexity.

The FFT works optimally for \(n = 2^p\).

fft(bivar$y)[1:4] / sqrt(200)
[1] -12.567+ 0.000i  -2.958+25.186i   6.896+14.840i   0.499+ 2.137i
betahat[1:4]
[1] -12.567+ 0.000i  -2.958+25.186i   6.896+14.840i   0.499+ 2.137i

Other bases

In addition to the Fourier basis and polynomials there are:

  • Wavelets. Families of orthogonal functions obtained by scaling and dilation. Some have local support, some are smooth. It's difficult to get both. Efficient \(O(n)\) computation of the wavelet transform.
  • Splines. In particular B-splines that have local support and give sparse matrices.
  • Several others I don't know about.

For any particular basis there might be computational shortcuts like sparsity, FFT or the wavelet transform.

Running means again

runMean <- function(y, k) {
  n <- length(y)
  m <- floor((k - 1) / 2)
  k <- 2 * m + 1
  y <- y / k
  s <- rep(NA, n)
  s[m + 1] <- sum(y[1:k])
  for(i in (m + 1):(n - m - 1)) 
    s[i + 1] <- s[i] - y[i - m] + y[i + 1 + m]
  s
}

Running means and filter

The filter function does the same.

y <- rnorm(7)
rbind(runMean(y, k = 3),
c(filter(y, rep(1/3, 3))))
     [,1]  [,2]   [,3]   [,4]   [,5]   [,6] [,7]
[1,]   NA 0.177 -0.279 -0.282 0.2454 0.3941   NA
[2,]   NA 0.177 -0.279 -0.282 0.2454 0.3941   NA
rbind(runMean(y, k = 5),
c(filter(y, rep(1/5, 5))))
     [,1] [,2]   [,3]     [,4]     [,5] [,6] [,7]
[1,]   NA   NA 0.2023 -0.01549 -0.03353   NA   NA
[2,]   NA   NA 0.2023 -0.01549 -0.03353   NA   NA

Benchmarking

A benchmark comparison between runMean and filter gives the following table.

                              expr   mean  median
1        runMean(y[1:512], k = 11)  882.8  845.66
2       runMean(y[1:1024], k = 11) 1777.8 1696.21
3       runMean(y[1:2048], k = 11) 3590.5 3377.31
4       runMean(y[1:4196], k = 11) 7189.2 6941.83
5  filter(y[1:512], rep(1/11, 11))  127.6   99.19
6 filter(y[1:1024], rep(1/11, 11))  145.1  122.87
7 filter(y[1:2048], rep(1/11, 11))  210.5  186.35
8 filter(y[1:4196], rep(1/11, 11))  333.5  328.52

Thus filter is substantially faster than runMean.

Linear algebra approach

library(Matrix)
bandSparse(5, 5, seq(-1, 1))
5 x 5 sparse Matrix of class "ngCMatrix"
              
[1,] | | . . .
[2,] | | | . .
[3,] . | | | .
[4,] . . | | |
[5,] . . . | |

Using matrix multiplication

K <- bandSparse(200, 200, seq(-2, 2))
weights <- c(1/3, 1/4, rep(1/5, 196), 1/4, 1/3)
weights <- c(NA, NA, rep(1/5, 196), NA, NA)
p <- qplot(x, y, data = bivar)
p + geom_line(aes(y = as.numeric(K %*% bivar$y) * weights))

Benchmark

Many linear smoothers can be computed much faster than via a matrix multiplication of the smoother matrix and the observation vector, which has time complexity \(O(n^2)\).

If the smoother matrix is sparse, however, matrix multiplication can be much faster.

We will present some benchmark comparisons below. First we compare the runtime for the matrix multiplication as.numeric(K %*% bivar$y) * weights using a sparse matrix (as above) with the runtime using a dense matrix. The dense matrix is given as Kdense = as.matrix(K). These runtimes are compared to using filter. In all computations, \(k = 5\).

Benchmark

Benchmark, comments

The difference in slopes between dense and sparse matrix multiplication should be noted. This is the difference between \(O(n^2)\) and \(O(n)\) runtime.

The runtime for the dense matrix multiplication will not change with \(k\). For the other two it will increase (linearly) with increasing \(k\).

For smoothing only once with a given smoother matrix the time to construct the matrix should also be taken into account for fair comparison with filter. We do this next, where we also compare with runMean.

The function bandSparse is not optimized for the specific running mean banded matrix, and I have written a faster C++ function for this job.

Fast band pattern matrix

cppFunction(
  'List fastBand(int n, int k) {
  int N = (2 * k + 1) * (n - 2 * k) + 3 * k * k + k;
  int iter = 0;
  IntegerVector i(N), p(n + 1);
  for(int col = 0; col < n; ++col) {
    p[col] = iter;
    for(int r = std::max(col - k, 0); r < std::min(col + k + 1, n); ++r) {
      i[iter] = r;
      iter++;
    }
  }
  p[n] = N;
  return List::create(_["i"] = i, _["p"] = p);
  }')

And the R function

bandSparseFast <- function(n, k) {
  n <- as.integer(n)
  k <- as.integer(k)
  tmp <- fastBand(n, k)
  new("ngCMatrix", 
      i = tmp$i, 
      p = tmp$p, 
      Dim = c(n, n))
}

Benchmark

Benchmark, comments

The construction of the sparse matrix turns out to take up much more time than the matrix-vector multiplication. The runtime is still \(O(n)\), but the constant is of the order of a factor 16 larger than for filter.

With the faster construction of the sparse matrix, the constant is reduced to being of the order 5 larger than for filter.

For small \(n\) there is some overhead from the constructor of the sparse matrix object even for the faster algorithm.

Take-home points for sparse matrices

If you implement an algorithm (like a smoother) using linear algebra (e.g. a matrix-vector product) then sparse matrix numerical methods can be useful compared to dense matrix numerical methods.

The Matrix package for R implements sparse matrices.

Always use methods for constructing the sparse matrix that avoid dense intermediates (if possible).

Even sparse linear algebra cannot compete with optimized special purpose algorithms like filter or a C++ implementation of runMean.

Benefits of the matrix approach

The filter function works for kernels (weights) with equidistant data.

For non-equidistant data the sparse matrix approach could be a good solution if the kernel has local support.

One nice idea to investigate in Assignment 1 is how well a sparse matrix implementation performs, compared to an R and a C++ implementation, for kernel smoothing with non-equidistant data points and a kernel with local support.

Comparing output

qplot(1:200, as.numeric(K %*% bivar$y) * weights - 
        c(filter(bivar$y, rep(1/5, 5)))) +
  scale_y_continuous("Difference")

Comparing output

all(as.numeric(K %*% bivar$y) * weights == c(filter(bivar$y, rep(1/5, 5))))
[1] FALSE
all.equal(as.numeric(K %*% bivar$y) * weights, 
          c(filter(bivar$y, rep(1/5, 5))))
[1] TRUE
identical(as.numeric(K %*% bivar$y) * weights, 
          c(filter(bivar$y, rep(1/5, 5))))
[1] FALSE

Exam

  • 25 min. in total, max 15 min. presentation.
  • Explore and experiment. There is no single correct solutions.
  • Present results and final solutions …
  • … but also first solutions and failures.
  • Prioritize and focus on the important stuff.
  • Don't try to cover everything.
  • Don't rush the explanations. Explain code you show. Skip code you don't want to explain.
  • Introduce the theoretical context, but keep theory to a minimum.
  • More info on Absalon.